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TRAJECTORY AVERAGING FOR STOCHASTIC 
APPROXIMATION MCMC ALGORITHMS 1 

By Faming Liang 

Texas A&M University 

The subject of stochastic approximation was founded by Rob- 
bins and Monro [Ann. Math. Statist. 22 (1951) 400-407]. After five 
decades of continual development, it has developed into an important 
area in systems control and optimization, and it has also served as a 
prototype for the development of adaptive algorithms for on-line esti- 
mation and control of stochastic systems. Recently, it has been used 
in statistics with Markov chain Monte Carlo for solving maximum 
likelihood estimation problems and for general simulation and opti- 
mizations. In this paper, we first show that the trajectory averaging 
estimator is asymptotically efficient for the stochastic approxima- 
tion MCMC (SAMCMC) algorithm under mild conditions, and then 
apply this result to the stochastic approximation Monte Carlo algo- 
rithm [Liang, Liu and Carroll J. Amer. Statist. Assoc. 102 (2007) 
305-320]. The application of the trajectory averaging estimator to 
other stochastic approximation MCMC algorithms, for example, a 
stochastic approximation MLE algorithm for missing data problems, 
is also considered in the paper. 

1. Introduction. Robbins and Monro (1951) introduced the stochastic 
approximation algorithm to solve the integration equation 

(1) h(6)= [ H(0,x)f e (x)dx = O, 

Jx 

where 9 G G C R rfe is a parameter vector and fg(x), x S X C M. dx , is a density 
function depending on 9. The dg and d x denote the dimensions of 9 and x, 
respectively. The stochastic approximation algorithm is an iterative recursive 
algorithm, whose each iteration consists of two steps: 
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Stochastic approximation algorithm. 

• Generate X k +i ~ fg k (x), where k indexes the iteration. 

• Set Ok+i = 9 k + o-kH(9 k , Xk + i), where a k > is called the gain factor. 

The stochastic approximation algorithm is often studied by rewriting it 
as follows: 

(2) 9 k+1 = 9 k + a k [h(9 k ) + s k+1 ], 

where h(9 k ) = J x H(9 k ,x)fg k (x) alx corresponds to the mean effect of H(9 k , 
X k +i), and Sk+i = H(9 k , X k +i) — h(9 k ) is called the observation noise. In 
the literature of stochastic approximation, h(9) is also called the mean field 
function. It is well known that the optimal convergence rate of (2) can be 
achieved with a k = —F~ 1 /k, where F = dh(9*) j 89 , and 9* denotes the zero 
point of h(9). In this case, (2) is reduced to Newton's algorithm. Unfor- 
tunately, it is often impossible to use this algorithm, as the matrix F is 
generally unknown. 

Although an optimal convergence rate of 9 k cannot be obtained in gen- 
eral, in a sequence of fundamental papers Ruppert (1988), Polyak (1990) and 
Polyak and Juditsky (1992) showed that the trajectory averaging estimator 
is asymptotically efficient; that is, 9 n = Ylk=i ®k/ n can converge in distribu- 
tion to a normal random variable with mean 9* and covariance matrix E, 
where E is the smallest possible covariance matrix in an appropriate sense. 
The trajectory averaging estimator requires {a k } to be relatively large, de- 
creasing slower than 0(l/k). As discussed by Polyak and Juditsky (1992), 
trajectory averaging is based on a paradoxical principle: a slow algorithm 
having less than optimal convergence rate must be averaged. 

Recently, the trajectory averaging technique has been further explored in 
a variety of papers [see, e.g., Chen (1993), Kushner and Yang (1993, 1995), 
Dippon and Renz (1997), Wang, Chong and Kulkarni (1997), Tang, L'Ecuyer 
and Chen (1999), Pelletier (2000) and Kushner and Yin (2003)] with differ- 
ent assumptions for the observation noise. However, up to our knowledge, 
it has not yet been explored for stochastic approximation MCMC (SAM- 
CMC) algorithms [Benveniste, Metivier and Priouret (1990), Chen (2002), 
Kushner and Yin (2003), Andrieu, Moulines and Priouret (2005), Andrieu 
and Moulines (2006)]. The stochastic approximation MCMC algorithms re- 
fer to a class of stochastic approximation algorithms for which the sample is 
generated at each iteration via a Markov transition kernel; that is, {xft+i} 
is generated via a family of Markov transition kernel {Pg k (x k , •)} controlled 
by {9k}- Recently, the stochastic approximation MCMC algorithms have 
been used in statistics for solving maximum likelihood estimation problems 
[Younes (1989, 1999), Moyeed and Baddeley (1991), Gu and Kong (1998), 
Gu and Zhu (2001)], and for general simulation and optimizations [Liang, 
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Liu and Carroll (2007), Atchade and Liu (2010)]. It is worth to point out 
that in comparison with conventional MCMC algorithms, for example, the 
Metropolis-Hastings algorithm [Metropolis et al. (1953), Hastings (1970)], 
parallel tempering [Geyer (1991)], and simulated tempering [Marinari and 
Parisi (1992), Geyer and Thompson (1995)], the stochastic approximation 
Monte Carlo (SAMC) algorithm [Liang, Liu and Carroll (2007)] has signif- 
icant advantages in simulations of complex systems for which the energy 
landscape is rugged. As explained later (in Section 3), SAMC is essentially 
immune to the local trap problem due to its self-adaptive nature inher- 
ited from the stochastic approximation algorithm. SAMC has been suc- 
cessfully applied to many statistical problems, such as p- value evaluation 
for resampling-based tests [Yu and Liang (2009)], Bayesian model selection 
[Liang (2009), Atchade and Liu (2010)] and spatial model estimation [Liang 
(2007a)], among others. 

In this paper, we explore the theory of trajectory averaging for stochas- 
tic approximation MCMC algorithms, motivated by their wide applications. 
Although Chen (1993, 2002) considered the case where the observation noise 
can be state dependent, that is, the observation noise £k+i depends on 
9q, ... ,9^, their results are not directly applicable to the stochastic approx- 
imation MCMC algorithms due to some reasons as explained in Section 5. 
The theory established by Kushner and Yin (2003) can potentially be ex- 
tended to the stochastic approximation MCMC algorithm, but, as mentioned 
in Kushner and Yin [(2003), page 375] the extension is not straightforward 
and more work needs to be done to deal with the complicated structure of the 
Markov transition kernel. In this paper, we propose a novel decomposition of 
the observation noise for the stochastic approximation MCMC algorithms. 
Based on the proposed decomposition, we show the trajectory averaging es- 
timator is asymptotically efficient for the stochastic approximation MCMC 
algorithms, and then apply this result to the SAMC algorithm. These re- 
sults are presented in Lemma A. 5, Theorems 2.3 and 3.2, respectively. The 
application of the trajectory averaging technique to other stochastic approx- 
imation MCMC algorithms, for example, a stochastic approximation MLE 
algorithm for missing data problems, is also considered in the paper. 

The remainder of this paper is organized as follows. In Section 2, we 
present our main theoretical result that the trajectory averaging estimator 
is asymptotically efficient for the stochastic approximation MCMC algo- 
rithms. In Section 3, we apply the trajectory averaging technique to the 
SAMC algorithm. In Section 4, we apply the trajectory averaging technique 
to a stochastic approximation MLE algorithm for missing data problems. In 
Section 5, we conclude the paper with a brief discussion. 
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2. Trajectory averaging for a general stochastic approximation MCMC 
algorithm. 

2.1. A varying truncation stochastic approximation MCMC algorithm. 
To show the convergence of the stochastic approximation algorithm, restric- 
tive conditions on the observation noise and mean field function are required. 
For example, one often assumes the noise to be mutually independent or to 
be a martingale difference sequence, and imposes a sever restriction on the 
growth rate of the mean field function. These conditions are usually not sat- 
isfied in practice. See Chen [(2002), Chapter 1] for more discussions on this 
issue. To remove the growth rate restriction on the mean field function and 
to weaken the conditions imposed on noise, Chen and Zhu (1986) proposed a 
varying truncation version for the stochastic approximation algorithm. The 
convergence of the modified algorithm can be shown for a wide class of the 
mean filed function under a truly weak condition on noise; see, for exam- 
ple, Chen, Guo and Gao (1988) and Andrieu, Moulines and Priouret (2005). 
The latter gives a proof for the convergence of the modified algorithm with 
Markov state-dependent noise under some conditions that are easy to verify. 

Following Andrieu, Moulines and Priouret (2005), we consider the fol- 
lowing varying truncation stochastic approximation MCMC algorithm. Let 
{/C s , s > 0} be a sequence of compact subsets of such that 

(3) (J/C s = e and K s C int(/C s+ i), s > 0, 

s>0 

where int(^4) denotes the interior of set A. Let {a k } and {b k } be two mono- 
tone, nonincreasing, positive sequences. Let Xq be a subset of X, and let 
T : X x — > Xq x /Co be a measurable function which maps a point (x,9) 
in X x to a random point in Xq x ICq; that is, both x and will be 
reinitialized in Xq x ICq. As shown in Lemma A. 5, for the stochastic approx- 
imation MCMC algorithm, when the number of iterations becomes large, 
the observation noise e k can be decomposed as 

(4) e k = e k + v k + <; k , 

where {e k } forms a martingale difference sequence, and the expectation of 
the other two terms will go to zero in certain forms. In Theorems 2.2 and 
2.3, we show that {e^} leads to the asymptotic normality of the trajectory 
averaging estimator 9 k , and {v k } and {q k } can vanish or be ignored when 
the asymptotic distribution of 6 k is considered. 

Let a k denote the number of truncations performed until iteration k and 
do = 0. The varying truncation stochastic approximation MCMC algorithm 
starts with a random choice of (6q, %o) in the space ICq x Xq, and then iterates 
between the following steps: 
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Varying truncation stochastic approximation MCMC algorithm. 

• Draw sample x k+ \ with a Markov transition kernel, Pg , which admits 
fg k (x) as the invariant distribution. 

• Set 6 k+1/2 = k + a k H(9 k ,x k+1 ). 

• If ||#fc+i/2 ~ 9 k \\ < bk and 0^+1/2 £ ^-o- fe; where ||z|| denote the Euclidean 
norm of the vector z, then set (9 k+ i,x k+ \) = (0fc+i/2> ^fc+i) and <7fc + i = o^; 
otherwise, set (6> fc+ i, x k+1 ) = T(9 k ,x k ) and <7 fc+ i =a k + 1. 

As depicted by the algorithm, the varying truncation mechanism works in 
an adaptive manner as follows: when the current estimate of the parameter 
wanders outside the active truncation set or when the difference between 
two successive estimates is greater than a time-dependent threshold, then 
the algorithm is reinitialized with a smaller initial value of the gain factor 
and a larger truncation set. This mechanism enables the algorithm to select 
an appropriate gain factor sequence and an appropriate starting point, and 
thus to confine the recursion to a compact set; that is, the number of reini- 
tializations is almost surely finite for every (#0,^0) £^o x ^O' This result is 
formally stated in Theorem 2.1, which plays a crucial role for establishing 
asymptotic efficiency of the trajectory averaging estimator. 

Regarding the varying truncation scheme, one can naturally propose many 
variations. For example, one may not change the truncation set when only 
the condition ||#fc + i/2 — 9 k \\ < b k is violated, and, instead of jumping forward 
in a unique gain factor sequence, one may start with a different gain factor 
sequence (smaller than the previous one) when the reinitialization occurs. 
In either case, the proof for the theorems presented in Section 2.2 follows 
similarly. 

2.2. Theoretical results on the trajectory averaging estimator. The asymp- 
totic efficiency of 9 k can be analyzed under the following conditions. 

Lyapunov condition on h(9). Let {x,y) denote the Euclidean inner prod- 
uct. 

(Ai) is an open set, the function h : — > M. d is continuous, and there exists 
a continuously differentiable function v : — > [0, 00) such that: 

(i) There exists Mq > such that 

(5) £ = {#£©, (Vv(9),h(9)) = 0} C {9 g 0, v(9) < M }. 

(ii) There exists Mi £ (Mo, 00) such that Vmi is a compact set, 
where V M = {9 G 0,v(9) < M}. 

(iii) For any #£ 0\£, (Vv (9), h{9)) < 0. 

(iv) The closure of v(C) has an empty interior. 
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This condition assumes the existence of a global Lyapunov function v for 
the mean field h. If h is a gradient field, that is, h = — V J for some lower 
bounded real- valued and differentiable function J (9), then v can be set to J, 
provided that J is continuously differentiable. This is typical for stochastic 
optimization problems, for example, machine learning [Tadic (1997)], where 
a continuously differentiable objective function J{9) is minimized. 

Stability condition on h(9). 

(A2) The mean field function h(9) is measurable and locally bounded. There 
exist a stable matrix F (i.e., all eigenvalues of F are with negative real 
parts), 7 > 0, p G (0, 1], and a constant c such that, for any 9* G C, 

\\h(9) - F{9 - 9*)\\ <c\\9-9*\\ 1+p V0 6{0:||0-0*|| < 7}, 

where C is defined in (5). 

This condition constrains the behavior of the mean field function around 
the solution points. It makes the trajectory averaging estimator sensible 
both theoretically and practically. If h{9) is differentiable, the matrix F can 
be chosen to be the partial derivative of h(9), that is, dh(9)/d9. Otherwise, 
certain approximation may be needed. 



Drift condition on the transition kernel Pq . Before giving details of this 
condition, we first define some terms and notation. Assume that a transition 
kernel Pq is irreducible, aperiodic, and has a stationary distribution on a 
sample space denoted by X . A set C C X is said to be small if there exist a 
probability measure v on X , a positive integer I and 5 > such that 

Pg(x,A) >5v{A) VxG C^AeBx, 

where Bx is the Borel set defined on X. A function V : X — > [l,oo) is said 
to be a drift function outside C if there exist positive constants A < 1 and b 
such that 

P$V(x) < XV (x) + bl(x G C) Vx G X, 

where PqV(x) = f x Po(x,y)V(y)dy. For a function g:X — >M. d , define the 
norm 

11 11 ||g(s)|| 
\\g\\ v = sup — 

and define the set Ly = {g: X — > W d ,sup xe x\\g\\v < 00}. Given the terms 
and notation introduced above, the drift condition can be specified as fol- 
lows. 
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(A3) For any given 6 G O, the transition kernel Pq is irreducible and aperi- 
odic. In addition, there exists a function V : X — > [1, 00) and a constant 
a > 2 such that for any compact subset K, C 0: 

(i) There exist a set C C X, an integer /, constants < A < 1, b, 
5 > and a probability measure 1/ such that 

(6) supP^V a (x) < \V a {x) + bI(xeC) VxeX, 

(7) supP e V a (x) <qV a (x) Vx£X, 
0G/C 

(8) inf P l e {x,A) >5u(A) \/x eC,\/A<EB x . 

(ii) There exists a constant c > such that, for all x & X, 

(9) SUp||iT(0,x)||y <c, 

eeK. 

(10) sup ||#(0,aO-i7(0',x)||y <c||0-0'||. 
(6,6')&C 

(iii) There exists a constant c > such that, for all (0,9') S /C x /C, 

(11) ||P e5 -P^||^<c|| 5 ||y||0-0'|| V 5 G£y, 

(12) \\Pe9-Po'9\\v"<c\\g\\v40-0'\\ V<7G£ yQ . 
Assumption (A3)(i) is classical in the literature of Markov chain. It implies 

the existence of a stationary distribution fg(x) for all 6 € O and y a -uniform 
ergodicity [Andrieu, Moulines and Priouret (2005)]. Assumption (As)(ii) 
gives conditions on the bound of H(6,x). This is a critical condition for the 
observation noise. As seen later in Lemmas A.l and A. 5, it directly leads 
to the boundedness of some terms decomposed from the observation noise. 
For some algorithms, for example, SAMC, for which H(8,x) is a bounded 
function, the drift function can be simply set as V(x) = 1. 

Conditions on the step-sizes. 

(A4) The sequences {a^} and are nonincreasing, positive and satisfy 
the conditions: 



(13) 



> afc = 00, hm (kak) = 00, 
k=l 

- = o(a fe+ i), b k = 0(a> k ) 



ak 

for some r 6 (0,1], 

00 (l+r)/2 

(14) E 



< 00, 
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and for some constants a > 2 as defined in condition (A3), 

00 

(15) ]T{aA + < 00. 

i=l 

It follows from (14) that 

k (l+r)/2 
i=[fc/2] V 

where [z] denotes the integer part of z. Since a k is nonincreasing, we have 

k 

(l+r)/2 1 /-1 \ 

i=[fc/2] V ' 

and thus a^ 1 " 1 " \/& = o(l), or = 0(k~ v ) for 77 G (^, 1). For instance, = 
Ci/k 11 for some constants C\ > and 77 G (^, 1), then we can set b k = C^/fc^ 
for some constants C2 > and £ G (^,77 — ^), which satisfies (13) and (15). 
Under this setting, the existence of r is obvious. 

Theorem 2.1 concerns the convergence of the general stochastic approx- 
imation MCMC algorithm. The proof follows directly from Theorems 5.4, 
5.5 and Proposition 6.1 of Andrieu, Moulines and Priouret (2005). 

Theorem 2.1. Assume conditions (Ai), (A3) and (A4) hold. Let k a de- 
note the iteration number at which the ath truncation occurs in the stochastic 
approximation MCMC simulation. Let Xq C X be such that sup^g^ V(x) < 
00 and that /Co C Vm , where Vm ^ s defined in (Ai). Then there exists al- 
most surely a number, denoted by o~ s , such that k as < 00 and fe^+i = 00; 
that is, {9k} has no truncation for k > k as , or mathematically, 

9 k+1 = 9 k + a k H(6 k ,x k+1 ) Mk>k 

Ln addition, we have 



9 k ->9* a.s. 

for some point 9* G C 

Theorem 2.2 concerns the asymptotic normality of 9 k . 

Theorem 2.2. Assume conditions (Ai), (A 2 ), (A3) and (A 4 ) hold. Let 
Xq C X be such that sup^g^ V(x) < 00 and that /Co C Vm , where Vm ^ 
defined in (Ai). Then 

y/k(0 h -0*)—>N(O,T) 

for some point 9* G G, where V = F~ l Q(F~ 1 ) T , F = dh(9*)/d9 is negative 
definite, Q = lirm^oo E(e k e^), and e k is as defined in (4). 
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Below we consider the asymptotic efficiency of 9k- As already mentioned, 
the asymptotic efficiency of the trajectory averaging estimator has been 
studied by quite a few authors. Tang, L'Ecuyer and Chen (1999) gives the 
following definition for the asymptotic efficient estimator that can be re- 
sulted from a stochastic approximation algorithm. 

Definition 2.1. Consider the stochastic approximation algorithm (2). 
Let {Z n } n >Q, given as a function of {0 n }n>o, be a sequence of estimators of 
9*. The algorithm {Z n } n >o is said to be asymptotically efficient if 

(16) V^(Zn-9*)^N(0,F- 1 Q(F~ 1 ) T ), 

where F = dh(y* )/dy, and Q is the asymptotic covariance matrix of (1/ y/n) x 

As mentioned in Tang, L'Ecuyer and Chen (1999), Q is the smallest 
possible limit covariance matrix that an estimator based on the stochas- 
tic approximation algorithm (2) can achieve. If 9k — > 9* and {£k} forms 
or asymptotically forms a martingale difference sequence, then we have 
Q = linifc-s.oo E(ek£k)- ^ n ^ ne nex t theorem, we show that the asymptotic 
covariance matrix Q established in Theorem 2.2 is the same as Q, and thus 
the trajectory averaging estimator 9 k is asymptotically efficient. 

Theorem 2.3. Assume conditions (Ai), (A 2 ), (A 3 ) and (A 4 ) hold. Let 
Xq C X he such that sup x£Xo V{x) < oo and that /Co C Vm , where Vm is 
defined in (Ai). Then 9^ is asymptotically efficient. 

As implied by Theorem 2.3, the convergence rate of 9k, which is measured 
by the asymptotic covariance matrix T, is independent of the choice of the 
gain factor sequence as long as the condition (A4) is satisfied. The asymp- 
totic efficiency of 9k can also be interpreted in terms of Fisher information 
theory. Refer to Pelletier [(2000), Section 3] and the references therein for 
more discussions on this issue. 

Trajectory averaging enables smoothing of the behavior of the algorithm 
but at the same time, it slows down the numerical convergence because it 
takes longer for the algorithm to forget the first iterates. An alternative 
idea would be to consider moving window averaging algorithms, see, for 
example, Kushner and Yang (1993) and Kushner and Yin (2003), Chapter 
11. Extension of their results to the general stochastic approximation MCMC 
algorithm will be of great interest. 
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3. Trajectory averaging for the stochastic approximation Monte Carlo 
algorithm. 

3.1. The SAMC algorithm. Suppose that we are interested in sampling 
from the following distribution 

(17) f(x) = cijj(x), xeX, 

where c is an unknown constant, X C M. dx is the sample space. The basic 
idea of SAMC stems from the Wang-Landau algorithm [Wang and Landau 
(2001), Liang (2005)] and can be briefly explained as follows. Let Ei, . . . , E m 
denote a partition of X , and let Wj = J E . ip(x) dx for i = 1, . . . , m. SAMC 
seeks to draw sample from the trial distribution 

(!8) fU x ) K V I {x£E i }, 

1=1 

where 7Tj's are prespecified constants such that 7Tj > for all i and TTi = 1, 
and I^ x& Ei} = 1 i£ x £ Ei and otherwise. For example, if the sample space 
is partitioned according to the energy function into the following subregions: 
Ei = {x:-log(ip(x)) < ui}, E 2 = {x-.ux < -log(^(x)) < u 2 },...,E m = 
{x : — log(ip(x)) > u m -i}, where — oo < u\ < • ■ • < u rn ^\ < oo are the user- 
specified numbers, then sampling from f w {x) would result in a random walk 
(by viewing each subregion as a "point") in the space of energy with each 
subregion being sampled with probability 7Tj. Here, without loss of generality, 
we assume that each subregion is unempty; that is, assuming J E ip(x) dx > 
for all i = 1, ... ,m. Therefore, sampling from (18) essentially avoids the 
local-trap problem suffered by the conventional MCMC algorithms. This 
is attractive, but cjj's are unknown. SAMC provides a dynamic way to es- 
timate Wj's under the framework of the stochastic approximation MCMC 
algorithm. 

In what follows we describe how uj can be estimated by SAMC. Since 
fuj(x) is invariant with respect to a scale change of uj, it suffices to estimate 

(i) 

uji, . . . ,uj m ^i by fixing uj rn to a known constant provided uj m > 0. Let 6 k 
denote the working estimate of log(ajj/7Tj) obtained at iteration k, and let 

®k = (#j^ , • • • , 6%™ ^ ) • Why this reparameterization is used will be explained 
at the end of this subsection. Let {a&} denote the gain factor sequence, and 
let {/C s , s > 0} denote a sequence of compact subsets of © as defined in (3). 
For this algorithm, {IC s ,s > 0} can be chosen as follows. Define 

(19) ^) = -lo g fl-^EYT- 7r 0l' 
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where Si = J E . ip(x) dxj exp(^W) for i = 1, . . . , m — 1, and S = Yli^i &i + 
J E .ip(x) dx. Clearly, v(8) is continuous in 6, and Vm — {0-v(6) < M} for 
any M G (0, oo) forms a compact subset of G. Therefore, {Vm s ,s > 0}, < 
Mo < Mi < • ■ •, is an appropriate choice of {/C s , s > 0}. For the SAMC algo- 
rithm, as seen below, ||fl'(0 fc ,.X fc+1 )|| = \\(I {xk+l£El} - m, . . . ,I {xk+ieEm _ l} - 

^m.-i) T \\ is bounded by the constant s/2, so we can set the drift function 
V(x) = 1. Hence, the initial sample xo can be drawn arbitrarily from Xq = X, 
while leaving the condition sup x£Xo V(x) < oo holds. In summary, SAMC 
starts with an initial estimate of 0q S /Co > and a random sample drawn arbi- 
trarily from the space X, and then iterates between the following steps. 
SAMC algorithm. 

(a) (Sampling.) Simulate a sample x k+ \ by a single MH update with the 
target distribution 

(2°) f0 k (x)0C22 -^rh^Ei} + Hx)I{x&E m }, 

provided that E m is nonempty. In practice, E m can be replaced by any 
other unempty subregion. 

(a.l) Generate y according to a proposal distribution q{x^y). 
(a. 2) Calculate the ratio 

g (j(* fc »_ (j(y)) ip(y)g(y,x k ) 
ip{x k )q{x k ,y) 

where J(z) denotes the index of the subregion that the sample z 
belongs to. 

(a.3) Accept the proposal with probability min(l,r). If it is accepted, 
set x k+ \ = y, otherwise, set Xk+\ = x k . 

(b) (Weight updating.) Set 

( 21 ) 9 i ) +l/2 = 6 k ) + a k+l{I{x k+1 (LE l } -7Ti), t = l,...,m-l. 

(c) (Varying truncation.) If 6 k+1 / 2 G tC ak , then set (6 k+1 ,x k+1 ) = (9 k+1 / 2 ,x k +i) 
and a k+ i = a k ; otherwise, set (6 k+ i,x k+ i) = T(9 k ,x k ) and a k+1 = <x fe + l, 
where a k and 7~(-, •) are as defined in Section 2. 

SAMC sampling is driven by its self-adjusting mechanism, which, conse- 
quently, implies the superiority of SAMC in sample space exploration. The 
self-adjusting mechanism can be explained as follows: if a subregion is vis- 
ited at iteration k, 9 k will be updated accordingly such that the probability 
that this subregion (other subregions) will be revisited at the next itera- 
tions will decrease (increase). Mathematically, if x&+i € Ei, then ^^.+ 1 / 2 4— 
9 k + afc+i(l — 7Tj) and ^+i/ 2 ^~ 9^ — a k+ i-Kj for j ^ i. Note that the linear 
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adjustment on 9 transforms to a multiplying adjustment on oj. This also ex- 
plains why SAMC works on the logarithm of oj. Working on the logarithm 
enables oj to be adjusted quickly according to the distribution of the sam- 
ples. Otherwise, learning of oj would be very slow due to the linear nature 
of stochastic approximation. Including 7Tj in the transformation log(wj/7Tj) 
facilitates our computation, for example, the ratio r in step (a. 2). 

The self-adjusting mechanism has led to successful applications of SAMC 
for many hard computational problems, including phylogenetic tree recon- 
struction [Cheon and Liang (2007, 2009)], neural network training [Liang 
(2007b)], Bayesian network learning [Liang and Zhang (2009)], among oth- 
ers. 



3.2. Trajectory averaging for SAMC. To show that the trajectory aver- 
aging estimator is asymptotically efficient for SAMC, we assume the follow- 
ing conditions. 

(Ci) The MH transition kernel used in the sampling step satisfies the drift 
condition (A3). 

To ensure the drift condition to be satisfied, Liang, Liu and Carroll (2007) 
restrict the sample space X to be a compact set, assume f(x) to be bounded 
away from and 00, and choose the proposal distribution q(x,y) to satisfy 
the local positive condition: for every x G X, there exist positive e\ and £2 
such that 

(22) \\x-y\\<£i => q(x,y)>e 2 - 

If the compactness condition on X is removed, we may need to impose some 
constraints on the tails of the target distribution f(x) and the proposal 
distribution q(x,y) as done by Andrieu, Moulines and Priouret (2005). 

(C2) The sequence {ak} satisfies the following conditions: 

/ ak = 00, lim (kak) = 00, 
fe=l 

00 (l+r)/2 

= o{a k+1 ), > 7^ <0 ° 

a k ^ Vk 

for some r G (0, 1]. 

For the SAMC algorithm, as previously discussed, ||ii'(0fc,.Xfc + i)|| is bounded 
by the constant \/2, so we can set V(x) = 1 and set a to any a large number 
in condition (A3). Furthermore, given a choice of a k = 0(k~ v ) for some 77 £ 

(1/2, 1), there always exists a sequence for example, bk = 2a k 1+T ^ 2 for 
some r G (0, 1], such that the inequality ||#fc+i/2 — @k\\ = \\o-kH(6k,Xf : +i)\\ < 
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b k holds for all iterations. Hence, a specification of the sequence {bk} can be 
omitted for the SAMC algorithm. 

Theorem 3.1 concerns the convergence of SAMC. In the first part, it states 
that k (7a is almost surely finite; that is, {0k} can be included in a compact 
set almost surely. In the second part, it states the convergence of Ok to a 
solution 8*. We note that for SAMC, the same convergence result has been 
established by Liang, Liu and Carroll (2007) under (Ci) and a relaxed con- 
dition of (C2), where {a^} is allowed to decrease at a rate of 0(l/k). Since 
the focus of this paper is on the asymptotic efficiency of k , the convergence 
of {Ok} is only stated under a slower decreasing rate of {a^}. We also note 
that for SAMC, we have assumed, without loss of generality, that all subre- 
gions are unempty. For the empty subregions, no adaptation of {0k} occurs 
for the corresponding components in the run. Therefore, the convergence 
of {Ok} should only be measured for the components corresponding to the 
nonempty subregions. 

Theorem 3.1. Assume conditions (Ci) and (C2) hold. Then there ex- 
ists (a.s.) a number, denoted by a s , such that k as < 00, k as+ i = 00, and {Ok} 
given by the SAMC algorithm has no truncation for k > k as! that is, 

(23) k+1 = k + a k H(0 k ,x k+ i) V/c > k as 
and 

(24) k ^0* a.s., 

where H(0 k ,x k+ i) = {I{ Xk+1 eE x } - tti, •■ ■ , I{x k+1 eE m ^} ~ x m -i) T , and 0* = 

(log(wi/7ri) - log(w m /7T m ),. . .,log(w m _l/7T m _l) - log(w m /7T m )) T . 

Theorem 3.2 concerns the asymptotic normality and efficiency of k . 

Theorem 3.2. Assume conditions (Ci) and (C2). Then 0k is asymp- 
totically efficient; that is, 

Vk(6k-0*)—>N(O,T) as k ^00, 

where T = F~ 1 Q(F~ 1 ) T , F = dh(0*)/d0 is negative definite and Q = 
lim fc _ Kx) £;(e fe e£). 

The above theorems address some theoretical issues of SAMC. For prac- 
tical issues, please refer to Liang, Liu and Carroll (2007), where issues, such 
as how to partition the sample space, how to choose the desired sampling 
distribution, and how to diagnose the convergence, have been discussed at 
length. An issue particularly related to the trajectory averaging estimator is 



14 



F. LIANG 



the length of the burn-in period. To remove the effect of the early iterates, 
the following estimator: 



instead of 9^, is often used in practice, where ko is the so-called length of 
the burn-in period. It is obvious that the choice of &o should be based on 
the diagnosis for the convergence of the simulation. Just like monitoring 
convergence of MCMC simulations, monitoring convergence of SAMC sim- 
ulations should be based on multiple runs [Liang, Liu and Carroll (2007)]. 
In practice, if only a single run was made, we suggest to look at the plot 
of 7? to choose ko from where 7?^ has been approximately stable. Here, we 
denote by 7?^ the sampling frequencies of the respective subregions realized 
by iteration k. It follows from Theorem 3.1 that 7?j. — > it when the number 
of iterations, k, becomes large. 

Trajectory averaging can directly benefit one's inference in many appli- 
cations of SAMC. A typical example is Bayesian model selection, where the 
ratio Ui/ujj just corresponds to the Bayesian factor of two models if one 
partitions the sample space according to the model index and imposes an 
uniform prior on the model space as done in Liang (2009). Another example 
is inference for the spatial models with intractable normalizing constants, for 
which Liang, Liu and Carroll (2007) has demonstrated how SAMC can be 
used to estimate the normalizing constants for these models and how the es- 
timate can then be used for inference of the model parameters. An improved 
estimate of the normalizing constant function would definitely benefit one's 
inference for the model. 

4. Trajectory averaging for a stochastic approximation MLE algorithm. 

Consider the standard missing data problem: 

• y is the observed incomplete data. 

• f(x,0) is the complete data likelihood, that is, the likelihood of the com- 
plete data (x,y) obtained by augmenting the observed data y with the 
missing data x. The dependence of f(x,9) on y is here implicit. 

• p(x,9) is the predictive distribution of the missing data x given the ob- 
served data y, that is, the predictive likelihood. 

Our goal is to find the maximum likelihood estimator of 9. This problem 
has been considered by a few authors under the framework of stochastic 
approximation; see, for example, Younes (1989), Gu and Kong (1998) and 
Delyon, Lavielle and Moulines (1999). A basic algorithm proposed by Younes 
(1989) for the problem can be written as 




(25) 



8k+i — Qk + o-kdelog f(X k+ i,9k), 
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where the missing data Xk+\ can be imputed using a MCMC algorithm, such 
as the Metropolis-Hastings algorithm. Under standard regularity conditions, 
we have 

h(9) = Ee[d e log f(X,9)] = d e l(9), 

where 1(9) is the log- likelihood function of the incomplete data. 

To show that the trajectory averaging estimator is asymptotically efficient 
for a varying truncation version of the algorithm (25), we assume (A3), (A4) 
and some regularity conditions for the distribution f(x,9). The conditions 
(Ai) and (A2) can be easily verified with the following settings: 

• The Lyapunov function v(9) can be chosen as v(9) = —1(9) + C, where C 
is chosen such that v(9) > 0. Thus, 

(Vv(9),h(9)) = -\\d e l(9)f. 

The set of stationary points of (25), {9 : (Vv(9),h(9)) = 0}, coincides with 
the set of the solutions {9:d$l(9) = 0}. Then the condition (Ai) can be 
verified by verifying that 1(9) is continuously differentiable (this is problem 
dependent). 

• The matrix F trivially is the Hessian matrix of 1(9). Then (A2) can be 
verified using the Taylor expansion. 

In summary, we have the following theorem. 

Theorem 4.1. Assume conditions (A3) and (A4) hold. Then the es- 
timator 9^ generated by a varying truncation version of algorithm (25) is 
asymptotically efficient. 

In practice, to ensure the drift condition to be satisfied, we may follow 
Andrieu, Moulines and Priouret (2005) to impose some constraints on the 
tails of the distribution f(x,9) and the proposal distribution q(x,y). Al- 
ternatively, we can follow Liang, Liu and Carroll (2007) to choose a pro- 
posal satisfying the local positive condition (22) and to restrict the sample 
space X to be compact. For example, we may set X to a huge space, say, 
X = [— 10 100 , lO 100 ]^ . As a practical matter, this is equivalent to setting 
X = R d *. 

5. Conclusion. In this paper, we have shown that the trajectory averag- 
ing estimator is asymptotically efficient for a general stochastic approxima- 
tion MCMC algorithm under mild conditions, and then applied this result 
to the stochastic approximation Monte Carlo algorithm and a stochastic 
approximation MLE algorithm. 

The main difference between this work and the work published in the 
literature, for example, Polyak and Juditsky (1992) and Chen (1993), are 
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at the conditions on the observation noise. In the literature, it is usually 
assumed directly that the observation noise has the decomposition = 
e k + v ki where {e^} forms a martingale difference sequence and is a higher 

1/2 

order term of o(a k ). As shown in Lemma A. 5, the stochastic approximation 
MCMC algorithm does not satisfy this decomposition. 

APPENDIX A: PROOFS OF THEOREMS 2.2 AND 2.3 

Lemma A.l is a partial restatement of Proposition 6.1 of Andrieu, Moulines 
and Priouret (2005). 

Lemma A.l. Assume condition (A3) holds. Then the following results 
hold: 

(Bi) For any 9 £ Q, the Markov kernel Pq has a single stationary distri- 
bution fg. In addition, H:Q x X — > O is measurable for all 9 S O, 
f x \\H(9,x)\\f e (x)dx <oo. 

(B2) For any 9 £ O, the Poisson equation u(9,x) — Pgu(9,x) = H(9,x) — 
h(9) has a solution u(9,x), where Pqu(9,x) = J x u(9, x')Pg(x, x') dx' . 
There exist a function V : X — > [1, 00) such that {x € X, V(x) < 00} / 
0, and a constant f3 £ (0, 1] such that for any compact subset K, C O, 
the following holds: 

(i) sup\\H(9,x)\\v < 00, 
9eK 

(ii) Bup(||u(0,aO||v + \\Peu(9,x)\\ v ) < 00, 

(26) 

(hi) sup \\9-9'\\-P(\\u(9,x)-u(9',x)\\ v 

+ \\Peu{6,x)-P e ,u{6',x)\\ v )<< X . 

Lemma A. 2 is a restatement of Proposition 5.1 of Andrieu, Moulines and 
Priouret (2005). 

Lemma A. 2. Assume conditions (Ai), (A3) and (A4) hold. Let Xq C X 
be such that sup xeXQ V(x) < 00 and that /Co C Vm , where Vo is defined 
in (Ai). Then sup k E[V a (Xk)I(k > k as )] < 00, where a > 2 is defined in 
condition (A3) and k as is defined in Theorem 2.1. 

Lemma A. 3 is a restatement of Corollary 2.1.10 of Duflo (1997), pages 46 
and 47. 

Lemma A. 3. Let {S n i,G n i,l <i< k n ,n > 1} be a zero-mean, square- 
integrable martingale away with diffcvcficcs Vni? 

where Q n i denotes the a- 
field. Suppose that the following assumptions apply: 
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(i) The a -fields are nested: Q n i C G n +i t i for 1 < i < k n , n> 1. 

(ii) X^i=i ^(^m^nil&M-i) — ^ ^ * n probability, where A is a positive def- 
inite matrix. 

(iii) For any e > 0, Yh=i E [\\ v ni\\ 2 I(\\ Vni \\>e)\Gn,i-i] m probability. 
Then S nkn = Yli=i v m N(0, A) in distribution. 

Definition A.l. For g £ (0, oo), a sequence {X n ,n > 1} of random vari- 
ables is said to be residually Cesaro ^-integrable [RCI(fj), in short] if 

1 n 

sup — y E\Xi\ < oo 
n>l ™ r-f 

and 



1 - 

lim - J pE(\X i \-i e )I(\X i \>i 8 ) = 0. 



n— >oo Ti 

t=l 



Lemma A. 4 is a restatement of Theorem 2.1 of Chandra and Goswami 
(2006). 

Lemma A. 4. Let {X n ,n > 1} 6e a sequence of nonnegative random vari- 
ables satisfying E(XiXj) < E(Xi)E(Xj) for all i^ j and let S n = Yli=i -^i- 
If{X n ,n>l} is RCI(^) for some g £ (0, 1), then 

— \S n — E(S n )) — >• in probability, 
n 

Lemma A. 5. Assume conditions (Ai), (A3) and (A4) hold. Let Xq C X 
be such that sup xg ^ V(x) < 00 and that ICq C Vm > where Vo is defined in 
(Ai). 1/ /c CTs < 00, which is defined in Theorem 2.1, then there exist M. d - 
valued random processes {e k } k > ktTs , {^k}k>k aa and {^k}k>k lTB defined on a 
probability space (fi,.F,P) such that: 

(i) e k = e k + v k + c; k for k>k aa . 

(ii) {e k }k>k rTs ^ a martingale difference sequence, and -^Y^k=k a &k — ^ 
N(0,Q) in distribution, where Q = limjfe_ > . 00 E{e k e\). 

(iii) 7^E J =fc (Ta - E ll^ll^0, as k^ 00. 

(iv) E\\ J2i=k as a ^i\\ ->■ 0, as A; -> 00. 

Proof, (i) Let e fc(Ts = i/ fc(Ts = ^ = 0, and 

e k+1 = u(9 k ,x k+1 ) - Po k u(6 k ,x k ), 

v k +i = [Po k+1 u(O k+1 ,x k+ x) - Pe k u{9 k ,x k+1 )] 
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(27) + Q ^~a fc+1 

4+i = a k+ iPg k u(8 k ,x k ), 

It is easy to verify that (i) holds by noticing the Poisson equation given in 
(B 2 ) ; 

(ii) By (27), we have 

E(e k+X | JF k ) = E(u(8 k ,x k+1 )\JF k ) - Pg k u(8 k ,x k ) = 0, 

where {F k } k > kas is a family of cr-algebras satisfying a{9 k(7s , x kas } C T 
and a{8 krjs , flfc^+i, • • . , 6> fe ; x k<Js , x fc(Ts+1 , . . . , x k } C .F fe C J" fe+1 for all k>k as . 
Hence, {e k } k > kcrs forms a martingale difference sequence. 

When k as < oo, there exists a compact set 1C such that 9 k G /C for all 
k > 0. Following from Lemmas A.l and A. 2, {efc}fc>fc CT is e& is uniformly 
square integrable with respect to k, and the martingale s n = ^fc=i e fc * s 
square integrable for all n. 

By (27), we have 

E(e k+l e k+1 \F k ) = E[u(8 k ,x k+1 )u(8 k , x k+1 ) T \T k ] 

(28) - P dk u(8 k ,x k )Pe k u(8 k ,x k ) T 

= K°k,x k ). 

Following from Lemmas A.l and A. 2, ||Z(0fc,a;fe)|| is uniformly integrable 
with respect to k. Hence, {l(8 k ,x k ), k > k as } is RCI(£>) for any g > (Defi- 
nition A.l). Since {E(e k+ \e^ +1 \T k ) — E(e k+ \e^ +l )} forms a martingale dif- 
ference sequence, the correlation coefficient Corr(Z(0j,Xj), l(6j,Xj)) = for 
all i^j- By Lemma A. 4, we have, as n — > oo, 
^ n ^ n 

(29) - V l(9 k ,x k )^- V El(9 k ,x k ) in probability. 
n ^-^ n ^-^ 

k — k(j g k — k(j g 

Now we show that El(8 k ,x k ) also converges. It follows from (Ai) and (B2) 
that 1(8, x) is continuous in 8. By the convergence of 8 k , we can conclude 
that l(8 k ,x) converges to 1(8* ,x) for any x G X . Following from Lemmas A.l, 
A. 2 and Lebesgue's dominated convergence theorem, El(8 k ,x k ) converges to 
El(8* ,x). Combining with (29), we obtain 

1 ™ 

(30) - V l(8 k ,x k )^El(8*,x)= lim E(e k eT) in probability. 

n fc->oo 

k' = k ( ja 
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Since ||efc|| can be uniformly bounded by an integrable function cV(x), the 
Lindeberg condition is satisfied, that is, 



£ E 



\ e i\\ 2 

_J— J (||ei[|/VS>6) 



n 



as n — > oo. 



Following from Lemma A. 3, we have Y27=k 6i I — ^ -^(0> Q) ^y identifying 
ei/y/n to v ni , n to k n , and Ti to Q ni . 

(iii) By condition (A4), we have 

Ofc+2 — / \ 

= o(a k+2 ). 

By (27) and (26), there exists a constant c\ such that the following inequality 
holds: 

IN+illv < ci||^ + i -Qk\\ +o(a k+2 ) = c 1 \\a k H(9 k ,x k+1 )\\ + o(a k+2 ), 
which implies, by (26), that there exists a constant c 2 such that 

(31) ||i/ fc+ i||y2 < c 2 a k . 

Since V(x) is square integrable, v k is uniformly integrable with respect to k 
and there exists a constant C3 such that 

00 _,, I, 00 

EM* ^ a k „ 

where the last inequality follows from condition (A4). Therefore, (iii) holds 
by Kronecker's lemma. 

(iv) A straightforward calculation shows that 

k 

a^i = -4+1 = -a k+ iPe k u{9 k ,x k ). 

i = h(j s 

By Lemmas A.l and A. 2, E\\Po k u(9 k , x k )\\ is uniformly bounded with respect 
to k. Therefore, (iv) holds. □ 

By Theorem 2.1, we have 

(32) 6 k+l -9* = (9 k - 9*) + a k h(9 k ) + a k e k+l Vfc > k as . 

To facilitate the theoretical analysis for the random process {9 k }, we define 
a reduced random process: {9 k } k > , where 



(33) 9 k 



9 k + ?k, k>k aa , 
9 k , 0<k<k o 
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which is equivalent to set q k = for all k = 0, . . . , k ag . For convenience, we 
also define 



(34) e k = e k + u k , 

It is easy to verify that 

% +1 -e* = (I + a h F)(0 k -0*] 



k > k a . 



(35) 
which implies 



'k+l 



+ a k {h{6 k ) - F{6 k - 6*)) + a k e k+l Vfc > k as , 



(36) 



j = ka 



3 = k(Ts 



where <3>fcj = Y\i = j(I + a iF) if > j, and = I, and / denotes the 

identity matrix. 

For 7 specified in (A2) and a deterministic integer ko, define the stopping 
time n = min{j : j > k , \\6j -9*\\> 7} if \\6 ko -6*\\ < 7 and if ||6» fco -6*\\> 
7. Define 

(37) A = {i: k as < k < i < /*}, 

and let I A (k) denote the indicator function; I A (k) = 1 if A; 6 A and other- 
wise. Therefore, for all k>ko, 

(e k +i-e*)i A (k + i) 

(38) =$ kM (9 ko -e*)I A (k + l) + 



Yl ®k,j+iaje j+1 I A (j) 
■j=ko 



I A {k + l) 
I A (k + l). 



-j=ko 

Including the terms Ia(j) in (38) facilitates our use of some results published 
in Chen (2002) in the later proofs, but it does not change equality of (38). 
Note that if I A (k + 1) = 1, then I A (j) = 1 for all j = ko,...,k. 

Lemma A. 6. (i) The following estimate takes place: 



(39) 



a ; 



a k 



exp 



Vfc>j,Vj>l, 



1=3 



where o(l) denotes a magnitude that tends to zero as j — > 00. 
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(ii) Let c be a positive constant, then there exists another constant c\ 
such that 



k I k \ 

(40) ^<exp( -c ^ aj j <ci Vfc>l,Vr>l. 

i=l V j=i+l / 

(iii) There exist constants cq > and c > suc/i i/iai 

(41) Ufcfcjll <c exp|-c^ai| Vfc>j,Vj>0. 

(iv) Lei = I]i=j( a j-i ~ a j)$i-i,j ^en 67 fc j is uniformly 
bounded with respect to both k and j for 1 < j < k, and 

1 k 

(42) -yj||g fc; j|| — >0 ask^-oo. 

3=1 

Proof. Parts (i) and (iv) are a restatement of Lemma 3.4.1 of Chen 
(2002). The proof of part (ii) can be found in the proof of Lemma 3.3.2 of 
Chen (2002). The proof of part (iii) can be found in the proof of Lemma 
3.1.1 of Chen (2002). □ 



Lemma A. 7. If conditions (Ai)-(A4) hold, then 

' -E\\{9 k+1 -e*)i A (k + i)f 



is uniformly bounded with respect to k, where the set A is as defined in (37). 

Proof. By (33) and (27), we have 
i i 

3* 1 1 2 \\n /)* ~ 1 1 2 



k+1 — & || — ||tffc+l — & — ?fc+l| 



a k+l a k+l 

2 

ak+i 



< — II 2 + 2a k+1 \\Po k u(O k ,x k ) 



2 



Following from (B2) and Lemma A. 2, it is easy to see that E\\PQ k u{9 k ,x k 
is uniformly bounded with respect to k. Hence, to prove the lemma, it suf- 
fices to prove that -^E\\(9 k+ i - 9*)I A (k + 1)|| 2 is uniformly bounded with 
respect to k. 

By (33), (A2) and (B2), there exist constants c\ and C2 such that 

\\h(9 j )-F(9 j -9*)\\i A U) 

= \\h(9 j )-F(9 j -9*)-F,~ j \\I A (j) 

(43) 

< \\h(9j) - F(0j - 9*)\\I A (j) + c 2 a j \\Pe j _ 1 u(Gj-i^j-i)\\ 

< Cl \\9j - 9*\\ l+ P + c 2 a j \\P 0j _ 1 u(9 j . l ,x^ l )\\. 
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In addition, we have 



(44) 



E\\6 ko -e*\\'i A (k ) = E\\e k() -e* + ^ \\ 2 i A (k ) 



<2\\e ko -8*\\ 2 I A (ko) + 2E\\s ko \ 



It is easy to see from (26) and (27) that q ko 1S square integrable. Hence, 
following from (37), there exists a constant 7 such that 



(45) 



E\\9 ka -0*\\ 2 I A (k o )<j. 



By (38), (41), (43) and (45), and following Chen [(2002), page 141] we 
have 



1 



-E\\(e k+1 -e*)i A (k + i) 



^ 5cq7 / 
< exp —2c } ai 

n.i. 1 1 1 



i=k 



K r 2 k k 
OCp 

t= ko j= ko 

K r 2 k k 
5C Q 

a k+l ^ ^ 



+ 



i= ko j=ko 
k k 

2 EH 



exp I -c ^2 a s J a? exp I -c ^ a s J ai\\Ee i+1 eJ +1 \ 

V s=j+l / V s=i+l ) 

exp -c ^2 a * Ujexp -c ^ a s \ aiE\\u i+1 uJ +1 \ 



hclc 2 



a k+l 



i=ko j=ko 



s=j+l 
k 



s=i+l 
k 



exp — c a s a 2 exp — c 



s=i+l 



+ 



OC C 1 jj~ 
a fc+l 



x ai£ , ||Pe i _ 1 n(6 , i _i,a; i _i)(P 03 _ 1 u(6' i „i,a; : ,-_i)) J 



^exp -c J2 a s )a j \\e j -e*\\ l+ Pl A (j) 



.j=ko 



s=j+l 



A 



h + h + h + h + h- 
By (39), there exists a constant C3 such that such that 



< — ° exp I o(l) ^2 a i J ex P f — 2c 



j=fen 



j=fen 



where o(l) — )■ as ko — > 00. This implies that o(l) — 2c < if /co is large 
enough. Hence, I\ is bounded if ko is large enough. 
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By (39) and (40), for large enough ko, there exists a constant C4 such that 



(46) Y exp ( -c Y a s )<Y a j ex P ( ~\ Y a A - c ^ 



s=j+l 



j=ko 



=j+l 



Since {e^} forms a martingale difference sequence (Lemma A. 5), 

E ei ej = E{E(e i \T i _ 1 )ej) = Vi > j, 
which implies that 



E 

i=k 



a 2 exp —2c Y2 a s J 



s=j+l 



< 5cgsupi?||ei|| 2 Y] 

i=fc 



af exp j 



s=j+l 



Since {||ej||,i > 1} is uniformly bounded by a function cV{x) which is square 
integrable, supj i?||ej|| 2 is bounded by a constant. Furthermore, by (40), I2 
is uniformly bounded with respect to k. 

By (27), (26) and condition (A4), there exist a constant cq and a constant 
r € (0, 1) such that the following inequality holds: 

(47) |K+i||v <c \\e k+1 -0 k \\ +o{a k+2 ) < c b k + o(a k+2 ) = 0(o^ +t) ' 2 ). 
This, by (Bi) and the Cauchy-Schwarz inequality, further implies that there 



exists a constant c' Q such that 



(48) 



E\\v i+1 v j+1 \\ < c a\ 



, (l+r)/2 (l+r)/2 



a ■ 



Therefore, there exists a constant C5 such that 



i=k j=k 



exp 



x exp j 



k k 

< &fc Y Y 

i=ko j=ko 



-c Y a s 

f k N 

-c ^ a s 
V s=i+l / 
/ fc 



exp 



E 

8=3+1 

k 



a, a 



1/2 



')C(a 



x exp 



Y a s )a i ^a^y 2 af + ^ 2 



s=i+l 



5c l c A Y 



a/^exp 



k 

Y as 

8=3' +1 
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By (40), ^3 is uniformly bounded with respect to k. 

Following from Lemmas A.l and A. 2, E\\Pg i _ 1 u(6i-i,Xi-i)(PQ j _ 1 u(9j-i, 
Xj-i)) T \\ is uniformly bounded with respect to k. Therefore, there exists a 
constant cq such that 

k k r / k 
h = 5clclc 6 "^2^2 exp I -c ^2 a s 

V s=j+l 



i=ko j=ko 
{j=ko 



exp 



-c a s 

s=i+l 



3/2 



exp 



2 E a * 

s=j+l / 



By (40), I4 is uniformly bounded with respect to k. 

The proof for the uniform boundedness of I§ can be found in the proof of 
Lemma 3.4.3 of Chen (2002), pages 143 and 144. □ 



Lemma A. 8. If conditions (Ai)-(A4) hold, then as k—^oo, 

k 

in probability. 



Proof. By (33) and (27), there exists a constant c such that 



1 fe 



<^ E nw-w-nn + -^ E oiii^-xu^i-i.x^o 



A 



To prove the lemma, it suffices to prove that ii and both converge to zero 
in probability as k — >• 00. 

Following from Lemmas A.l and A. 2, £<||P0 «($&, x)|| is uniformly bounded 
for all k > /c^ . This implies, by condition (A4), there exists a constant c such 
that 



Eaj£ , ||P6> ! _i^(#j-i,£j-i)|| a i 
7= <c) — = < 00. 
^ Vi 



i=l 



By Kronecker's lemma, E(l2) —> 0, and thus I2 — > in probability. 

The convergence Ji — >• can be established as in Chen [(2002), Lemma 
3.4.4] using the condition (A2) and Lemma A. 7. □ 



TRAJECTORY AVERAGING FOR SAMCMC 



25 



Proof of Theorem 2.2. By Theorem 2.1, 9 k converges to the zero point 
9* almost surely and 

6 k+ i = 6 k + a k H(6 k ,x k+ i) V/c>/c CTs . 

Consequently, we have, by (33), 

k 

Vk{9 k - 6*) = o(l) + -i= (*< 

(49) 



Vk _ 



where o(l) — )• as — > oo. 

Condition (A4) implies A^2n =ka a? — ► by Kronecker's lemma. Follow- 
ing Lemmas A.l and A. 2, there exists a constant c such that 

(50) 4e«^E^^ 



2 f\i(j & % r>J(7 



Therefore, ^= Tli=k a Q — ^ in probability as k — > 00. 
By (36), (49) and "(50), we have 

k 

Vk(h - 9*) = op(i) + ^ E *<-i,^ - r ) 

^ i— 1 

+ E E $ i-i,i+i a i^'+i 

i=k(j s j = ka s 

(51) 

fc i— 1 

+ 71 E E *i-ij + iaMOj)-noj-n) 

i = kcr s j = kcr s 

= o p {l) + h+I 2 + h, 

where o p (-) means 

^fc = o p (Z k ) if and only if Y k /Z k — > in probability, as k — >■ 00. 
By noticing that <&£j = $fc-i,j + a>kF&k—l ji we have 



=/ + E aii?$i - 1 J and F 1$ fcJ =F ^E^^- 1 
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and thus 

k k k 

By the definition of Gj-j given in Lemma A.6(iv), we have 

k 

(52) 0i _i = -F" 1 + G fcJj 



which implies 



-7= (~F 1 + Gk,k tTs )(0k tTs 



By Lemma A. 6, Gkj is bounded. Therefore, 1\ — > as k — > 00. The above 
arguments also imply that there exists a constant cq > such that 



(53) 

By (53), we have 



< c Vk,Vj<k. 



1 

co 



j=fc CTs i=j+l 



j = ka 



It then follows from Lemma A. 8 that ^3 converges to zero in probability as 
k — > 00. 

Now we consider I2. By (34) and (52), 

p— l ^ 1 ^ 

^ 6j+1 + "v^ ^ Gfc,j+ie j+ i 



v 7 ^ 

1 

Vk 



j—k(j s 
k 



j—k as 

+ ^fiZ (-F- 1 + G k , j+1 )v J+1 



j = ka 



= Jl + J 2 + J 3 - 

Since {e^} is a martingale difference sequence, 



E{ejGlfi Kj e 3 ) = E[E( ei \ F^xf Gifi^) =0 Vz > j 



T riT 
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-Ell^H 2 = t ^2 E (. e j+i G k,j+i G kj+iej+i) < - ^2 ||G feii+ i|| 2 £;||ej + i|| 2 . 

By the uniform boundedness of {i£||ei|| 2 ,i > k as }, (42) and the uniform 
boundedness of G k j, there exists a constant c\ such that 

k 

(54) E\\J 2 \\ 2 < j W G k,j+i\\ ^° asfc^oo. 

Therefore, J% — > in probability as k — > oo. 

Since G k j is uniformly bounded with respect to both k and j, there exists 
a constant C2 such that 

3 — ka s 

Following from Lemma A.5(iii), J3 converges to zero in probability as k — > 
00. 

By Lemma A. 5, J\ — > N(0,S) in distribution. Combining with the con- 
vergence results of I\, I3, J2 and J3, we conclude the proof of the theorem. 

Proof of Theorem 2.3. Since the order of s k is difficult to treat, we con- 
sider the following stochastic approximation MCMC algorithm: 

(55) k+1 = e k + a k (h(e k ) + e k+1 ), 

where {9 k } and {e k } are as defined in (33) and (34), respectively. Follow- 
ing from Lemma A.5(ii), {£k} forms a sequence of asymptotically unbiased 
estimator of 0. 

Let 9 n = ^2 k= \9 k /n. To establish that 9 is an asymptotically efficient 
estimator of 9* , we will first show (in step 1) 

(56) y/n(§-6*)->N(Q,T), 

where T = F" 1 Q(F" 1 ) T , F = dh{9*)/d9 and Q = linu^ E{e k el); and then 
show (in step 2) that the asymptotic covariance matrix of J2k=i^k/V^ 1S 
equal to Q. 

Step 1. By (34), we have 

1 n 

(57) = + -V>. 

k=l 
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By Lemmas A.l and A. 2, E\\Pg k _ 1 u(8k-i,Xk-i)\\ is uniformly bounded for 



k > k„ s and thus there exists a constant c such that 



E 



^ n ^ n 

~r= y~! * =£i "~r= e ^^-i 



_ 1 u(»jfc_i,a;fc_i / 



< 



Ofc. 



By Kronecker's lemma and (A4), we have Sfc=fc CT Ofc ~ ^ in probability. 
Hence, ^ Efe=fe CTs = °v( l ) and 



(58) 



That is 

(59) I n = ri + Op (n- 1 / 2 ). 

Following from Theorem 2.2 and Slutsky's theorem, (56) holds. 

Step 2. Now we show the asymptotic covariance matrix of Ylk=i^k/Vn 
is equal to Q. Consider 

£ (^£ & )(^£ & ) T -KfiH(£H T 

n 

E £ (« 



= iE^vt)+iEE^J)-^ 

fe=i i^j 
= (/ 1 ) + (/ 2 ) + (/ 3 ). 
By (34), we have 



,k=l 



E^ 



2 n 1 n 

( Ji ) = ^ E s ( e * e * ) + - E ^ e ^ T ) + - E ^fa"* ) 



k=l 



k=l 



k=l 



= (Jl) + (J 2 ) + (J 3 ). 

By (47), H^fc^/fllya = 0(a 1 k +T ) for k > k as , where r G (0, 1) is defined in (A4). 
Since y 2 (x) is square integrable, there exists a constant c such that 



1 n 

-J>|k^||< (l) + 



C 1_ 



E 

fc — kg- „ 



i 1+T 
l k ' 



which, by Kronecker's lemma and (A4), implies J3 — > as n — > 00. 

Following from Lemmas A.l and A. 2, {||efc||}fc>fc CT is uniformly bounded 
with respect to k. Therefore, there exists a constant c such that 
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Following from Lemma A.5(iii), J 2 — > as n — > 00. 

By (28), E(ek + ie£ +1 ) = El(9 k ,Xk). Since 1(9, x) is continuous in 9, it fol- 
lows from Theorem 2.1 that 1(0/., x) converges to 1(9*, x) for any x £ X. 
Furthermore, following from Lemma A. 2 and Lebesgue's dominated con- 
vergence theorem, we conclude that El(9k,X}.) converges to El(9*,x), and 
thus 

Ji -»• El(9*,x) = lim E(e k el) = Q. 

k— >oo 

Summarizing the convergence results of J\, J2 and J3, we conclude that 
(Ii) — > Q as n — )■ 00. 

By (34), iov j, i> k ag and j > A; CTs , we have 



E(iieJ) = E{(ei + z/;)(ej + z/,-)' J } = £(ejej + Vivj + + i^ej ) 



T 



(60) 



= E(uiuj), 

where the last equality follows from the result that {&k}k>k a is a martingale 
difference sequence [Lemma A. 5(h)]. By (48), there exists a constant c such 
that 



^W^iVj || S ca i a- , 



which implies that 

^EE^J) 



<o(l)+c 



(61) 

By Kronecker's lemma and (A4), X^r=fc CT 



n. n 

L V a a+r)/2 — V a 
/n * yfn ^ ■ 

i — k(j s J L j — k as 



,(l+r)/2 



,(l+r)/2 



/ y^n — > and thus 



(62) 



as n — f 00. 



In summary of (60) and (62), we have 

(63) (j 2 ) = l£££ (£ ~^)^o 



as n — > 00. 



By (47), there exists a constant c such that 



E^ 



fe=i 



<o 



^ n n 

(1)4 E E\\v k \\=o(l) + ^= a 
\/n z — ' \/n 



(l+r)/2 



By Kronecker's lemma and (A4), we have 



(64) 



1 



E^ 



/c=i 



as n — 7- 00. 



30 



F. LIANG 



By Lemma A.l(i) and (ii), where it is shown that {ek}k>k as is a martingale 
difference sequence, we have 



k=l 



s ^E(e k + v k ) 



n 



,fe=l 



Following from (64), we have (1%) — > as n — > 00. 

Summarizing the convergence results of (h), (I2) and (I3), the asymptotic 
covariance matrix of J2k=i^k/ ' \/n is equal to Q. Combining with (56), we 
conclude that 9 k is an asymptotically efficient estimator of 9*. 

Since 6 k and 6k have the same asymptotic distribution iV(0, T), 9 k is also 
asymptotically efficient as an estimator of 9*. This concludes the proof of 
Theorem 2.3. 



APPENDIX B: PROOFS OF THEOREMS 3.1 AND 3.2 

The theorems can be proved using Theorems 2.1 and 2.2 by showing that 
SAMC satisfies the conditions (Ai) and (A2), as (A3) is assumed, and (A4) 
and and the condition sup^g^ V{x) < 00 have been verified in the text. 

Verification of (Ai). To simplify notation, in the proof we drop the 

subscript k, denoting x k by x and denote 9 k = (0^ , ...,9^ n ^) by 6 = 
(9^\ . . . ,9^ m ~ 1 ^). Since the invariant distribution of the MH kernel is fe(x), 
we have for any fixed 9, 



E{I {X 



7T; 



X 



{I{xtE t }-Ki)fo{x)dx 



(65) 



Lijj(x)dx/ 



Si 
S ' 



0(i)i 



7T; 



7T; 



for i = 1, m— 1, where S{ = f E /ip(x) dx/e 8 * and S = YaL 1 Si + j E 
Therefore, 



ip(x) dx. 



h{9) 



H(9,x)f e (x)dx 



x 



Si 

s 



s 



m—1 

~s~ 



m—1 



It follows from (65) that h(9) is a continuous function of 9. Let A(9) 



1 _ 1 v^m— 1/ 
1 2 2_,,-=i I 



If Si 



1=1 ^ s 



and define v{9) = — log(A(0)) as in (19). As shown 
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below, v(9) is continuously differentiable. Since < ^YlT=i(~s~ ~ ^i) 2 < ~ 
lEJLl^f") 2 < 1 for all 9 G 9, v{6) takes values in the interval [0, oo). 

Solving the system of equations formed by (65), we have the single solution 



9® = c + log V(x) dxj - logfa), 



,m- 1, 



where c= — log(J E ^(x)(ix) + fog( 7r m)- It is obvious that v(9*) = 0, and 
v(£) has an empty interior, where 9* is specified in Theorem 3.1. Therefore, 
(Ai)(iv) is satisfied. 

Given the continuity of v(9), for any numbers M\ > Mq > 0, 9* G int(VMo)> 
and Vmi is a compact set, where hit (^4) denotes the interior of the set A. 
Therefore, (Ai)(i) and (Ai)(ii) are verified. 

To verify the condition (Ai)(iii), we have the following calculations: 



(66) 



8S 


dSi 






d(Si/S) 


Si 




s 



-Si 



dSi 8S 4 



0, 



d9d) 89® 

d(Sj/S) _ djSj/S) _ SjSj 
89(® 89U) S 2 

for i, j = 1, . . . ,m — 1 and i ^ j. Let b = YlY=i ^j/Si then we have 



l_El 
S 



dv{6) _ 
860) ~ 2A(9) 



1 



A(9) 
1 

W) 

1 

W) 



m—1 

E 
E 

'm—1 



djSj/S-TTj 



j i SiSj 



E(| 



S'i 



S, 



7T 



7T, 



Si 



S 2 



SiSj 



Ei 

5 



7T, 



S*, 



5, 
5 



s 1 _ 

s ni j s 



5 



ft 



S 



for i = 1, ... ,m — 1, where it is defined fj,£ = Y^JLi ( 



7T 



{Vv(0),h(0)) 
1 



AW 



(67) 



m—1 



A(0) 



i=i 

m—1 



Si 

s 



TV, 



Si. 
bS 



6 E(§-)§ 



i=i 



m—1 

«E 

i=l 



1.2 2 



El 

S 



7T, 



. Thus, 



Si 



bS 
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A(9) 



bo\ + 6(1 - %f ) < 0, 



where a 2 denotes the variance of the discrete distribution defined in the 
following table: 



State (£) ^ 
Prob. 



7Tl 



7T m -l 



If = 0*, (Vv(6),h(e)) = 0; otherwise, (Vv(9),h(9)) < 0. Therefore, (Ai)(iii) 
is satisfied. 

Verification of (A2). To verify this condition, we first show that h{6) has 
bounded second derivatives. Continuing the calculation in (66), we have 



d 2 {s l /s) _ s, 



s 



1 



Si 

s 



1 



2S 
S 



d 2 {Sj/S) 
d6<J) 86^ 



SjSj ■ 

" s 2 1 



25^ 

s 



which implies that the second derivative of h{6) is uniformly bounded by 
noting the inequality < % < 1. 



Let F = dh{6)/d0. By (66), we have 



F 



Si 

s 



V 



S2S1 
s 2 



Sm—lSl 

s 2 



SIS2 

^2 Sj 

' s { ~ s 



S\S> 



m—1 



s 2 

S2S m -l 



s 2 



Sm—1 ( j S m —i 



s 



s J 



Thus, for any nonzero vector z = (z\, . . . , Zm-i) 1 
"m—1 

z2 z 



z T Fz 



E2 Si 



SiY 



1=1 

m— 1 



5' 



„2 Sj 



i=l 
/ m—1 



s 



Si 



6(1-6) ]Tz, 



sA 2 



65 



z hs \^ Zi bs 

=1 \ t=i 

= -6Var(Z) - 6(1 - b){E{Z)) 2 < 0, 

where E(Z) and Var(Z) denote, respectively, the mean and variance of the 
discrete distribution defined by the following table: 

State (Z) z\ • ■ ■ z m -i 

Sm — 1 



Prob. 



Si 
bS 



bS 



This implies that the matrix F is negative definite and thus stable. Ap- 
plying Taylor's expansion to h(8) at the point 8*, we have 

\\h{6) - f{6 - e*)\\ < c \\e-e*\\ 1+p , 
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for some constants p G (0, 1] and c > 0, by noting that h{9*) = and that 
the second derivatives of h(9) are uniformly bounded with respect to 6. 
Therefore, (A2) is satisfied. 
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